Fault Tolerance
Review: write about fault tolerance problem being addressed, high level technique: replication (where and how), re-execution, etc.
Comments from review:
- questioning assumptions about trends (e.g. self configuring systems, hardware getting more reliable)
o QUESTION: What is really happening?
- how accurate is the data? It came from one system
o QUESTION: what do you expect?
i. MTTF
i. MTTF / MTTF + MTTR
ii. How make high availability?
1. Make MTTF big (highly reliable) or MTTR small (fast to repair)
iii. 99% ~3 days
iv. 99.9% ~9 hours
v. 99.99% ~1 hour
vi. 99.999% ~5 minutes
vii. 99.9999% ~30 seconds
i. Brokerage: $6,000,000
ii. Ebay: $225,5000
iii. Cell phone activation: $41,000
iv. Home shopping channel: $113,000
i. 900,000 hours – 10 years
i. Windows 2000: 72 weeks
i. Terminology:
1. Fault = bug in code
2. Error = erroneous state as a result of executing code
a. Latent errors: executed fault but did not cause failure yet
3. Failure = system does not act according to its specification
ii. Types
1. Bohr bugs / deterministic bugs:
a. Bugs that recur every time you do something – easily repeatable / predictable / can be tracked down and fixed / often found in testing
2. Heisenbugs / nondeterministic bugs
a. Bugs that donŐt recur every time / caused by an unlikely combination of events / hard to reproduce and repair
iii. Causes of failure
1. Hardware (cpu, devices) – 18%
2. Environment (network, power) – 14%
3. Software (OS, applications) – 25%
4. Operations (maintenance, administration) – 42%
iv. When do failures occur?
1. Infant mortality – new, under tested
2. Norma lifetime – highly reliable
3. Wear-out period (for HW) – things break physically, or (for SW) assumption about world have changed too much
v. Failure models – Why important?
1. Timing failures occur when a component violates timing constraints.
2. Output or response failures occur when a component outputs an incorrect value.
3. Omission failures occur when a component fails to produce an expected output.
4. Crash failures occur when the component stops producing any outputs.
5. Byzantine or arbitrary failures occur when any other behavior, including malicious behavior, occurs
vi. Synthetic failure models
1. Halt on failure
2. Failure status
3. Stable Storage
vii.
i. Fault Avoidance: make sure failures donŐt happen
1. Fault prevention: write code without bugs
a. better languages
b. better software engineering
c. tool usage during coding process
d. e.g. write a new OS in a new language, prove properties of implementation
2. Fault removal: remove bugs from code
a. e.g. run testing tool (valgrind, purify)
b. windows static driver verifier – find bugs statically
3. Fault workaround: make sure failures donŐt execute
a. Firewall / virus detector
b. ŇIt hurts when I runÓ ŇdonŐt runÓ
ii. Fault Tolerance
1. Allow failures to occur, but keep system running
2. Basic ideas:
a. Fault detection – figure out that something bad happened
b. Isolation – keep bad state from spreading to whole system
c. Recovery – get the bad part back into a good state
3. Basic approaches to error detection
a. Check dynamically for error conditions and inconsistencies to detect failures early
b. Use heart beats to make sure a module is still executing
c. QUESTION: how easy it to do this generically?
i. QUESTION: as code evolves?
ii. QUESTION: at what cost?
4. Basic approaches to isolation
a. Decompose into modules
i. Unit of failure is small
b. Check each module for errors
i. Fails fast – doesnŐt spread corruption
ii. Isolate from other modules
c. Hardware / software boundaries around modules
i. Whole machine
ii. address space
iii. extra instructions
5. Basic approaches to recovery
a. Restore system to a functioning state
i. E.g. configure extra modules to take over for failed module, restart failed module
b. Forwards / Backwards
c. Concealing / revealing
d. Basic approaches:
i. Logging / retry
ii. Checkpoint / restore
iii. Replicate (process pairs)
iv. Alternate versions
v. Transactions (undo)
vi. Reveal faults up the stack
e. Concepts:
i. Have multiple Ys, Multiple Xs that are identical. Switch between Xs when Ys fail
1. Fault Tolerance
ii. Isolate X from Y so survival of X does not depend on Y
1. Fault Containment
2. Some useful things fail, but not all - partitioning
f. Redundancy: do things twice or more
i. On two machines
ii. In two processes
iii. In two places (state in memory / on disk checkpoint)
iv. At two times (e.g. checkpoint / restore)
v. QUESTION: what kinds of bugs are handled?
g. Diversity: do things multiple different ways
i. Different platforms
ii. Different implementations
iii. Idea: unlikely to have common failure modes
iv. Name: n-version programming, recovery blocks
6. Basic questions for fault tolerance: where do you do the fault tolerance?
a. In the hardware (e.g. two processors, RAID with multiple disks)
b. Between the HW and the OS (e.g. virtual machine)
c. Within the OS
d. Between the OS and the application
e. Within the application
7. General principle:
a. If everything above layer X is identical, can tolerate faults at X or below automatically
i. E.g. FT unix -> HW, OS faults
ii. E.g. Hypervisor -> HW faults
iii. E.g. Nooks -> driver faults (everything else is above)
iv. E.g. Disco -> OS faults
b. If have some diversity above X, can tolerate heisenbugs above layer X
i. Process pairs – execute different streams
ii. Checkpoint / restart: if restart far enough back
i. Not much additional cost over unreliable
i. Not much additional hardware or software
i. Can make existing programs / os more reliable
i. Hardware
ii. Software
iii. human
i. Run two copies
ii. Switch from one to the other on failure of one
i. Lockstep processes – HW failures only
1. Both CPU do same work, no extra capacity
ii. Explicit State checkpoints – do computation, send state changes to backup
1. Backup can do computations from latest state
iii. Automatic checkpoints – log messages
1. Inefficient – donŐt know what to checkpoint, must send everything
iv. Delta checkpoints – send operations, not state
1. re-execute on other side. Reduces bandwidth
v. Persistent processes – only replicate persistent data and session existence, not transient per-session data – internal in-memory data structures
1. Make state changes persistent: e.g. all on disk
2. On failure, backup wakes up knowing sessions but not state
3. On failure, internal state is in unknown, inconsistent situation
i. Group of operations that form a consistent transformation of state - ACID
1. Atomic – all or nothing
2. Consistent – every transactional execution sees a correct picture of the state, even if other transactions are excuting
3. Integrity – is a correct state transformation
4. Durable – transactions had effects even if a failure occurs after transaction
ii. Operations
1. Begin transaction
2. Commit – make effects durable
3. Abort – undo partial effects
iii. Use for fault tolerance
1. Allows use of persistent process pairs
a. Allows undo of actions in a transaction that aborted
b. Allows reset of system to known good state
iv. QUESTION: What is great about transactions?
1. Can reason about state of system with failures
v. QUESTION: Why not
1. Programming cost
2. Performance cost – extra communication
vi. QUESTION: What is MTTR here?
1. Must detect failure
2. Backup must abort in-progress transactions
a. no state to sync or log to replay
i. Session abstraction
1. Sequenced
2. Retry on alternate path if path fails
3. Notify endpoints if all paths fail
4. Sessions handle switching to backup automatically if primary fails
5. On TX abort, sequence number reverts to beginning of transaction, intervening messages cancelled
i. Store on multiple disks
ii. Many replication options – take 739 for details
iii. Transactions + logs for ensuring storage updated consistently